68 research outputs found
Sorting suffixes of a text via its Lyndon Factorization
The process of sorting the suffixes of a text plays a fundamental role in
Text Algorithms. They are used for instance in the constructions of the
Burrows-Wheeler transform and the suffix array, widely used in several fields
of Computer Science. For this reason, several recent researches have been
devoted to finding new strategies to obtain effective methods for such a
sorting. In this paper we introduce a new methodology in which an important
role is played by the Lyndon factorization, so that the local suffixes inside
factors detected by this factorization keep their mutual order when extended to
the suffixes of the whole word. This property suggests a versatile technique
that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013
Lightweight LCP Construction for Very Large Collections of Strings
The longest common prefix array is a very advantageous data structure that,
combined with the suffix array and the Burrows-Wheeler transform, allows to
efficiently compute some combinatorial properties of a string useful in several
applications, especially in biological contexts. Nowadays, the input data for
many problems are big collections of strings, for instance the data coming from
"next-generation" DNA sequencing (NGS) technologies. In this paper we present
the first lightweight algorithm (called extLCP) for the simultaneous
computation of the longest common prefix array and the Burrows-Wheeler
transform of a very large collection of strings having any length. The
computation is realized by performing disk data accesses only via sequential
scans, and the total disk space usage never needs more than twice the output
size, excluding the disk space required for the input. Moreover, extLCP allows
to compute also the suffix array of the strings of the collection, without any
other further data structure is needed. Finally, we test our algorithm on real
data and compare our results with another tool capable to work in external
memory on large collections of strings.Comment: This manuscript version is made available under the CC-BY-NC-ND 4.0
license http://creativecommons.org/licenses/by-nc-nd/4.0/ The final version
of this manuscript is in press in Journal of Discrete Algorithm
Cyclic Complexity of Words
We introduce and study a complexity function on words called
\emph{cyclic complexity}, which counts the number of conjugacy classes of
factors of length of an infinite word We extend the well-known
Morse-Hedlund theorem to the setting of cyclic complexity by showing that a
word is ultimately periodic if and only if it has bounded cyclic complexity.
Unlike most complexity functions, cyclic complexity distinguishes between
Sturmian words of different slopes. We prove that if is a Sturmian word and
is a word having the same cyclic complexity of then up to renaming
letters, and have the same set of factors. In particular, is also
Sturmian of slope equal to that of Since for some
implies is periodic, it is natural to consider the quantity
We show that if is a Sturmian word,
then We prove however that this is
not a characterization of Sturmian words by exhibiting a restricted class of
Toeplitz words, including the period-doubling word, which also verify this same
condition on the limit infimum. In contrast we show that, for the Thue-Morse
word , Comment: To appear in Journal of Combinatorial Theory, Series
Universal Lyndon Words
A word over an alphabet is a Lyndon word if there exists an
order defined on for which is lexicographically smaller than all
of its conjugates (other than itself). We introduce and study \emph{universal
Lyndon words}, which are words over an -letter alphabet that have length
and such that all the conjugates are Lyndon words. We show that universal
Lyndon words exist for every and exhibit combinatorial and structural
properties of these words. We then define particular prefix codes, which we
call Hamiltonian lex-codes, and show that every Hamiltonian lex-code is in
bijection with the set of the shortest unrepeated prefixes of the conjugates of
a universal Lyndon word. This allows us to give an algorithm for constructing
all the universal Lyndon words.Comment: To appear in the proceedings of MFCS 201
On the Impact of Morphisms on BWT-Runs
Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms ? for which it is possible to find words w such that the difference r_bwt(?(w))-r_bwt(w) is arbitrarily large
Detecting Mutations by eBWT
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (EBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the EBWT.
Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the EBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the EBWT and LCP arrays.
Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity
Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transform
Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transfor
A New Class of Searchable and Provably Highly Compressible String Transformations
The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words
Computing the original eBWT faster, simpler, and with less memory
Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the
BWT to a collection of strings, however, since this introduction, it has been
used more generally to describe any BWT of a collection of strings and the
fundamental property of the original definition (i.e., the independence from
the input order) is frequently disregarded. In this paper, we propose a simple
linear-time algorithm for the construction of the original eBWT, which does not
require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we
obtain the first linear-time algorithm for computing the BWT of a single string
that uses neither an end-of-string symbol nor Lyndon rotations. We combine our
new eBWT construction with a variation of prefix-free parsing to allow for
scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets
of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate
that it is the fastest method for all collections, with a maximum speedup of
7.6x on the second best method. The peak memory is at most 2x larger than the
second best method. Comparing with methods that are also, as our algorithm,
able to report suffix array samples, we obtain a 57.1x improvement in peak
memory. The source code is publicly available at
https://github.com/davidecenzato/PFP-eBWT.Comment: 20 pages, 5 figures, 1 tabl
- …